field: force-inline 5x52 mul and sqr#1859
Conversation
|
Ran the benchmarks on i9-14900HX built with GCC 12.3, confirmed the speedups: Results (Min us, lower is better)
|
|
Seeing a ~12.5% speedup for both ECDSA and Schnorr verification and ~3% for signing on my arm64 machine (Snapdragon X Elite - X1E-78-100), using GCC 14.2.0: master: PR: Applying this change to the Bitcoin Core secp256k1 subtree (Branch apply-secp-pr1859) shows the speedup in the script verification benchmarks as well (run via master (commit theStack/bitcoin@654a522):
PR applied (commit theStack/bitcoin@494a473):
|
|
Concept ACK That's a very interesting observation. So far, we tried to stay away from guiding the compiler too much, but the ratio of added complexity vs. gains here is pretty good. @l0rinc What I always wanted to try is profile-guided optimizations, e.g., where the profile is generated in a benchmark run that only performs signature verification (this could even be done automatically as part of the build process). I imagine there could be more low-hanging fruits. Would you be interested in looking into this stuff as well? |
The 5x52 field multiplication and squaring routines are hot in group arithmetic and scalar multiplication. Use the new `SECP256K1_FORCE_INLINE` for the thin wrappers and `int128` inner helpers so compilers can schedule the 64x64->128 arithmetic without a call boundary. The helper uses forced inlining in optimized release-style builds, but falls back to `SECP256K1_INLINE` when no-inline, size optimization, or debug-style macros ask not to force it. Across the measured GCC and MSVC Release builds, this improves ECDSA verification by 0.6% to 9.1%, ECDH by 0.7% to 9.3%, and Schnorr verification by 0.6% to 9.6%. The direct field benchmarks generally show the intended effect on field squaring and multiplication, while Clang results are mostly flat and less consistently positive. This is a code-size tradeoff: the tested static library builds grew by about 4.6% to 4.7%, and the tested Windows Release DLL grew by 14.1%. Co-authored-by: Sebastian Falbesoner <sebastian.falbesoner@gmail.com>
|
Concept ACK. Master: This PR: (GCC 15.2.0 on Ryzen 5950X) |
ac915c9 to
1c537ab
Compare
| # define SECP256K1_INLINE inline | ||
| # endif | ||
|
|
||
| # if !defined(_DEBUG) && !defined(__NO_INLINE__) && !defined(__OPTIMIZE_SIZE__) |
There was a problem hiding this comment.
Because Microsoft's cl.exe defines neither the __OPTIMIZE_SIZE__ nor the __OPTIMIZE__ macro, building with cmake --build build --config MinSizeRel will still result in __forceinline being used.
There was a problem hiding this comment.
Yes, see #1859 (comment)
Do you think we should change anything here?
There was a problem hiding this comment.
We could manually define __OPTIMIZE_SIZE__ for the "MinSizeRel" configuration on Windows in the build system. It's not a huge deal, though, since we recommend using clang-cl.exe for Windows builds anyway.
hebasto
left a comment
There was a problem hiding this comment.
The Bitcoin Core project has a similar macro named ALWAYS_INLINE. Could we adopt SECP256K1_ALWAYS_INLINE here for consistency across the two closely related projects?
| # endif | ||
|
|
||
| # if !defined(_DEBUG) && !defined(__NO_INLINE__) && !defined(__OPTIMIZE_SIZE__) | ||
| # if defined(_MSC_VER) |
There was a problem hiding this comment.
Slightly unrelated (?) to this specific change:
On Windows "Release" builds, both cl.exe and clang-cl.exe hit this branch. However, for the SECP256K1_INLINE macro above, cl.exe uses the Microsoft-specific __inline while clang-cl.exe uses the standard inline.
We should probably make clang-cl.exe handle SECP256K1_INLINE and SECP256K1_FORCE_INLINE consistently, choosing either the MSVC extensions or the standard keywords for both.
Problem: The 5x52 field multiplication and squaring routines are hot in group arithmetic and scalar multiplication. Some compilers leave the thin wrappers and int128 inner helpers out of line, which keeps a call boundary in this hot path and limits scheduling of the 64x64->128 arithmetic.
Fix: Define
SECP256K1_FORCE_INLINEnext to the existing inline helper and use it for the 5x52 multiplication and squaring wrappers andint128inner helpers.For default optimized builds, this expands to
__forceinlineon MSVC-compatible compilers and to__attribute__((always_inline))on GCC-compatible compilers. It falls back to the existing inline spelling when inlining is disabled, when optimization is disabled, when optimizing for size on GCC/Clang, or when_DEBUGis defined.Benchmarks: Values are relative changes in
Min(us), lower is better.Tradeoffs: The speedups reproduce most consistently with GCC and MSVC. Clang was less consistently positive.
Inlining also increases code size:
libsecp256k1.alibsecp256k1.alibsecp256k1-*.dllLinux benchmarking script
Linux size comparison script
host: M4-Max.local, compiler: gcc-14 (Homebrew GCC 14.3.0) 14.3.0
host: WIN-A2EHOAU4JET (Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz), system: Microsoft Windows NT 10.0.20348.0, compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35728 for x64
host: i9-ssd, compiler: gcc (GCC) 16.1.0
host: i7-hdd, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: rpi5-16-3, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: rpi4-2-1, compiler: gcc (Ubuntu 14.2.0-19ubuntu2) 14.2.0
host: umbrel (Intel(R) N150), compiler: gcc (Debian 12.2.0-14+deb12u1) 12.2.0
host: nodl (Cortex-A53), compiler: gcc (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Reviewer measurements
andrewtoth, i9-14900HX, GCC 12.3
theStack, Snapdragon X Elite X1E-78-100, GCC 14.2.0
Bitcoin Core subtree
bench_bitcoin -filter=VerifyScript.*:sipa, Ryzen 5950X, GCC 15.2.0
clang:
host: i9-ssd, compiler: Ubuntu clang version 22.1.6 (++20260508084839+c0262e742787-1~exp1~20260508204859.77)
reindex-chainstate:
2026-05-28 | reindex-chainstate | 950059 blocks | dbcache 5000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | SSD